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Abstract 



c^ ' Consider a linear regression model. Fan and Li (2001) describe the smoothly 

clipped absolute deviation (SCAD) point estimator of the regression parameter vec- 

O 

cn . tor. To gain insight into the properties of this estimator, they consider an orthonor- 

Y •■ mal design matrix and focus on the estimation of a specified component of this 

C/^ ! vector. They show that the SCAD point estimator has three attractive properties. 

■S 



We answer the question: To what extent can an interval estimator, centred on the 
SCAD estimator, have similar attractive properties? 
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1. Introduction 

Consider tlie linear regression model Y = XP + e, where y is a random n- vector 
of responses, X is a known n x p design matrix with linearly independent columns, 
P is an unknown j9- vector and e ~ N{0,cr'^ln), where o"^ is an unknown positive 
parameter. In a widely-cited paper. Fan and Li (2001) describe a point estimator 
of /3 that they call the smoothly clipped absolute deviation (SCAD) estimator. The 
SCAD point estimator is designed to perform especially well when most of the 
components of /3 are believed to be zero (a sparsity type of assumption). 

In section 2 of Fan and Li (2001), to gain insight into the properties of this point 
estimator, the authors focus on the estimation of /3j (where i is specified) for the 
case that the columns of X are orthonormal (cf section 2.2 of Tibshrani, 1996 and 
Potscher and Schneider, 2010). This is the scenario that we consider throughout 
the present paper. Let sign(a;) be equal to —1 for a; < 0, for a; = and 1 for a; > 
and let x+ = max{a;, 0}. Let f3i denote the least squares estimator of /3j. Also let 
S^ denote the usual unbiased estimator of o"^. The SCAD estimator of /3j is 

rsign(A)(|A|-A)^ if m<2\ 

Pi =\[{a- l)/3i - sign(/3i)aA)/(a - 2) if 2A < lA] < a\ 

[a if \h\ > aX 

We adopt the proposal of Fan and Li (2001) that a = 3.7. For the purpose of gaining 
insight into the properties of the SCAD estimator, these authors suppose that (a) 
a is known to be 1 and A is a specified fixed value when they consider the mean 
square error (m.s.e.) of this estimator and (b) A = A„ is a non-random sequence 
that depends on n when they consider what they call the oracle property. In the 
present paper, for the purpose of gaining insight into the properties of confidence 
intervals centred on the SCAD estimator, we suppose that A = S ?] where r^ is a 
specified positive number. 

To assess the SCAD point estimator, assume (for the moment) that a is known 
and that X = arj, where ?7 is a specified positive number. We assess the SCAD 
point estimator by the ratio (m.s.e. of the SCAD estimator)/(m.s.e. of least squares 
estimator), which we call the scaled m.s.e.. This point estimator has the following 
attractive properties: 

(PI) It is a continuous function of the data. 



(P2) The scaled m.s.e. converges to 1 as \Pi/a\ — )> oo. 

(P3) The scaled m.s.e. is substantially less than 1 when Pi = 0. 

Now consider interval estimation of /3j. We assess a 1 — a confidence interval 
J for Pi by the ratio ii^(length of J)/i?(length of usual 1 — a confidence interval), 
which we call the scaled expected length. The corresponding attractive properties 
for a 1 — q; confidence interval for Pi are the following: 

(11) The endpoints of J are continuous functions of the data. 

(12) The scaled expected length converges to 1 as |/3j/o"| — )■ oo. 

(13) The scaled expected length is substantially less than 1 when Pi = 0. 

Farchione and Kabaila (2008) have already found 1 — a confidence intervals that 
possess all of these attractive properties. These intervals also have the appealing 
property that the maximum of the scaled expected length is not too large. The 
centres of these interval estimators do not resemble a SCAD estimator. This suggests 
that a 1 — a confidence interval centred on the SCAD estimator will not be able to 
have all of the attractive properties (II), (12) and (13). 

The SCAD estimator of Pi reverts to the least squares estimator Pi when |/3i| > 
aJ^r]. We consider a 1 — a confidence interval (for Pi) centred on this SCAD es- 
timator that, similarly, reverts to the usual 1 — a confidence interval for Pi when 
\Pi\ > aUrj. This confidence interval has the attractive property (12). We will 
also construct this confidence interval to have the attractive property (II). We ask 
the following question. To what extent can this confidence interval, centred on the 
SCAD estimator, have the property (13)? Let m = n — p. In Section 3, we consider 
1 — a = 0.95 and the cases (a) m = 200 (moderately large m) and rj = 0.5, 1,2 
and (b) m = 3 (small m) and rj = 0.5, 1, 2. In each of these cases, we show numeri- 
cally that this confidence interval, centred on the SCAD estimator, cannot have the 
property (13). This suggests that this confidence interval cannot have this property 
more generally. 

The SCAD point estimator may be viewed as being obtained from Pi, by a 
modification determined by |/3i|/S. Such a modification seems reasonable because 
|/3j|/S may be viewed as a test statistic for testing the null hypothesis Pi = 



against the alternative hypothesis /3j 7^ 0. In the present paper, we consider interval 
estimators centred at this SCAD estimator, with width 2S s(|/3j|/S), where the 
function s is quite flexible (the constraints on this function are specified in the next 
section). This width may be viewed as a modification of a given (non-random) 
multiple of S, by a modification determined by |/3i|/S. We use a finite-sample 
analysis of this confidence interval; we do not use any asymptotic approximations. 
To assume that a^ is known is effectively equivalent to assuming that n — p is large; 
we do not assume that o"^ is known. We require only that n — p > 1. In related 
work, Potscher and Schneider (2010) consider confidence intervals that include in 
their interior the hard-thresholding, LASSO (or soft thresholding) and adaptive 
LASSO estimators. However, these intervals are constrained to have a width that 
is a given (non-random) multiple of S (or a in the case that they assume that o"^ 
is known). So, the analysis carried out by Potscher and Schneider (2010) is quite 
different from the analysis presented in the present paper. 

2. The form of the confidence interval centred on the SCAD estimator 

Define the quantile t(m) by the requirement that P(— t(m) <T< t{m)) = 1 — a 
for T r^ tm- The usual 1 — a confidence interval for /3j is 

/= [A-t(m)S, A + t(m)S]. 

We consider the following confidence interval for /3j, centred at the SCAD estimator 

ft: 

J(s) = [ft - S s(|ft|/S), ft + S s(|ft|/S)] , 

where s : (0, 00) — ?> (0, 00) is a continuous function that satisfies s{x) = t{m) for all 
X > k, where k = arj = 3.71]. This confidence interval has the attractive properties 
(II) and (12). Farchione and Kabaila (2008) consider Xi, . . . ,X„ independent and 
identically N{fi, a^) distributed. They consider confidence intervals of the form 

:-(-f);-(f 

where X = n~^ J27=i -^i^ ^^ ~ (n — 1)~^ J27=ii-^i~-^)^ ^^^ c is a function satisfying 
c{x) > —c{—x) for all a; G M (so that the upper endpoint is always greater than or 
equal to the lower endpoint). It may be shown that J{s) has a similar form 

-Lc I — - 




Theorem 1 of Kabaila (2011) implies that if s is chosen such that J{s) is a 1 — a 
confidence interval, with scaled expected length less than 1 when /3j = 0, then the 
maximum value of the scaled expected length of J{s) must be greater than 1. 

The question that we ask is whether or not we can find a function s such that 
J{s) has the property (13). We do this by minimizing the scaled expected length of 
J{s) when f3i = 0, subject to the constraint that the coverage probability of J{s) 
never falls below 1 — a. 

3. Numerical results 

As noted in Appendix A, the scaled expected length and the coverage probability of 
J{s) are even functions of ^ = f3i/a. Let e{6; s) denote the scaled expected length 
of J{s). To minimize the scaled expected length of J{s) when ^ = (which is 
equivalent to f3i = 0), subject to the constraint that the coverage probability of J{s) 
never falls below 1 — a, we use the computationally-convenient expressions described 
in Theorem 1 (stated and proved in Appendix A). In Appendix B, we describe briefiy 
how the coverage probability of J{s) is computed using this theorem. 

For computational tractability, we have chosen the function s to be a natural 
cubic spline with equally-spaced knots in the interval [0, k] (with a knot at and 
a knot at k). Remember, k = at] = 3.71]. Let these knots be denoted xi, . . . ,Xq, 
where Xi = and Xq = k. Since we require that s{xq) = t{m), the objective function 
and the constraints for the constrained minimization problem that we consider are 
functions of the q — 1 variables s(xi), . . . , s{xq^i). 

Suppose that 1 — a = 0.95. For m = 200 (moderately large m) and m = 3 
(small m) and for rj = 0.5,1,2, we have computed the function s (specified by 
s{xi), . . . , s{xq^i)) that minimizes e(0; s), subject to the constraints that (a) s{x) > 
for all X G [0, k] and (b) the coverage probability of J{s) never falls below 1 — a. Let 
s* denote this constrained minimizing value of the function s. The properties of s* 
are summarized in the Tables 1 and 2 and Figures 1 and 2, below. The function s* 
depends on 1— a, m, k and Xi, . . . , Xq^i. For notational convenience, this dependence 
is left implicit. 

We implement this coverage constraint in the computations as follows. It may be 
shown that, for any reasonable choice of the function s, the coverage probability of 
J{s) converges to 1 — a as ^ — ?■ oo. The constraints implemented in the computations 



are that the coverage probabihty of J{s) is greater than or equal to 1 — a for every 
6 in a judiciously-chosen finite set of values. That a given finite set of values of 
6 is adequate to the task is judged by checking numerically, at the completion of 
computations, that the coverage probability constraint is satisfied for all ^ > 0. 

Table 1 presents some properties of this constrained minimizing function s* for 
the case that m = 200 and rj = 0.5, 1, 2. The number of knots of the cubic spline s 
in the interval [0, k] was chosen to be 4, 5 and 6 for each rj. Observe that, for each 
value of 1] considered, e(0; s*) is a decreasing function of the number of knots and 
that the decrease from 5 to 6 knots is small. This table shows that the confidence 
interval J{s), which is centred on the SCAD estimator, cannot possess the property 
(13) for m = 200 and these values of rj and numbers of equally-spaced knots. 



Table 1: Some properties of the constrained minimizing function s* for m 
and 1] = 0.5, 1 and 2. 

ri = 0.5 



200 



number of knots 


4 


5 


6 


e(0;s*) 


1.1609 


1.1274 


1.1250 


maxQe{6; s*) 


1.1609 


1.1274 


1.1250 



number of knots 


4 


5 


6 


e(0;s*) 


1.2940 


1.2826 


1.2825 


m.axoe{9; s*) 


1.3936 


1.3821 


1.3748 



number of knots 


4 


5 


6 


e(0;s*) 


1.2181 


1.2155 


1.2154 


m£iX0e{9; s*) 


2.1045 


5.5869 


5.5272 



Table 2 presents some properties of the constrained minimizing function s* for 
the case that m = 3 and rj = 0.5, 1, 2. The number of knots of the cubic spline s 
in the interval [0, k] was chosen to be 4, 5 and 6 for each rj. Observe that, for each 
value of 1] considered, e(0; s*) is a decreasing function of the number of knots and 
that the decrease from 5 to 6 knots is small. This table shows that the confidence 
interval J{s), which is centred on the SCAD estimator, cannot possess the property 
(13) for m = 3 and these values of rj and numbers of equally-spaced knots. 



Table 2: Some properties of the constrained minimizing function s* for m = 3 and 
f] = 0.5, 1 and 2. 

ri = 0.5 



number of knots 


4 


5 


6 


e(0;s*) 


1.0526 


1.0519 


1.0511 


maxge{9; s*) 


1.0759 


1.0782 


1.0796 



number of knots 


4 


5 


6 


e(0;s*) 


1.0977 


1.0966 


1.0950 


maxge{6; s*) 


1.3216 


1.3385 


1.3464 





r] = 2 






number of knots 


4 


5 


6 


e(0;s*) 


1.0824 


1.0815 


1.0788 


m.a.xge{6; s*) 


2.0858 


2.1650 


2.1193 



We now examine the properties of the constrained minimizing function s* in more 
detail for the case that m = 200, rj = 1 and the cubic spline s has 6 equally-spaced 
knots in the interval [0, k]. The top panel of Figure 1 is a plot of the scaled expected 
length e{6; s*) as a function of 6. This plot illustrates the fact that every confidence 
interval of the form J{s) possesses the attractive property (12). The bottom panel 
of this figure is a plot of the coverage probability of J{s*) as a function of 6. It is 
notable that this coverage probability is far above 0.95 for 6 G [0, 1]. We would like 
to be able to choose the function s so as to "trade" this high coverage probability 
for a small scaled expected length at ^ = 0. Evidently, using a confidence interval 
of the form J{s), centred on the SCAD estimator, does not allow this "trade" to 
occur. This is in sharp contrast to the confidence interval of Farchione and Kabaila 
(2008), which has coverage probability equal to 1 — a throughout the parameter 
space. It appears that by allowing their confidence interval to have a flexible centre, 
Farchione and Kabaila (2008) have allowed this "trade" to occur, resulting in a 
confidence interval that possesses all of the attractive properties (II), (12), (13) and 
maximum scaled expected length that is not too large. Figure 2 is a plot of the 
constrained minimizing function s* for this case. The knots of the cubic spline are 
denoted by small circles. 
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Figure 1: Properties of the constrained minimizing function s* for m = 200, 77 = 1 
and the cubic sphne s having 6 equally-spaced knots in the interval [0, k]. The top 
panel is a plot of the scaled expected length e{9; s*) as a function of 9. The bottom 
panel is a plot of the coverage probability of J{s*) as a function of 6. 




Figure 2: Plot of the constrained minimizing function s* for the case that m = 200, 
7] = 1 and the cubic sphne s has 6 knots in the interval [0, k]. 



4. Discussion 

To gain insight into tlie properties of the SCAD estimator, Fan and Li (2001) 
consider an orthonormal design matrix and focus on the estimation of a specified 
component of the regression parameter vector. We do the same, to gain insight into 
the properties of confidence intervals centred on the SCAD estimator. We consider 
1 — a confidence intervals centred on this SCAD estimator that revert to the usual 
\ — a confidence interval for the same data values as the SCAD estimator reverts 
to the least squares estimator. Our numerical results strongly suggest that these 
confidence intervals cannot be constructed to have the the important property (13). 
By contrast, in the context of a multivariate normal mean, the positive-part James- 
Stein point estimator dominates the usual estimator of this mean (using a sum of 
squared errors loss function). As shown by Casella and Hwang (1983), a sphere 
(with data-dependent radius) centred on this point estimator can be constructed so 
as to dominate the usual confidence set for this mean. 

Appendix A: Computationally convenient expressions for the scaled ex- 
pected length and the coverage probability of J(s) 

Define W = S/c and let fw denote the probability density function of W. Let 
e = /3i/a. Define 

{sign(a;) (|a;| — i]), if |a;| < 2r] 

((a — l)x — sign{x)aT])/{a — 2) if 2r] < \x\ < arj 
X if \x\ > at]. 

We use the notation 

_, ,, I 1 if ^ is true 
X{A) = < 

10 if ^ is false 

where A is an arbitrary statement. This is similar to the Iverson bracket notation 

(Knuth, 1992). Computationally convenient expressions for the scaled expected 

length and coverage probability of J{s) are provided by the following result. 

Theorem 1. (a) The scaled expected length of J{s) is equal to 

, {s{\x\) - t{m)) (j){wx-e)w'^ fw{w)dwdx. (1) 

t{m}tij{vv } j_k Jo 

For given function s, the scaled expected length of J{s) is an even function of 9. 

10 



(b) The scaled expected length of J{s) evaluated at 6 = is 

1 + — -^TT^TTTTT / (six) - t(m)) i — dx. 2 

(c) Define 

JO if niax(— t(m)u;, —kw — 6) > min(t(m)t(;, kw — 6) 

\(^ yrm'n.{t{ra)w , kw — d)) — (^ y\iva,'x{—t{m)w , —kw ~ d)) otherwise 

where $ denotes the A^(0, 1) cumulative distribution function. The coverage proba- 
bility of J{s) is equal to 

"k ^oo / f) \ 

1 1 h{x) — s(|s|) < — < h{x) + s(|a;|) ) <^{wx — 6)w fwi^j) dwdx 



kJO \ w / 

+ 1 — a — / b{w;m,d,9)fwiw)dw (3) 

Jo 

where denotes the N{0, 1) probability density function. For given functions h and 
s, this coverage probability is an even function of 9. 

Proof of part (a) 

The scaled expected length of J{s) is defined to be 

expected length of J(s) 



expected length of / 

This is equal to 

E{s{\Q\/W)W) 

tim)E{W) ^ ' 

where B = /3i/cr. It follows from Theorem 1(b) of Kabaila and Giri (2009) that (jl]) 

is equal to ([1]). 

Proof of part (b) 

It follows from ([T]) that the scaled expected length of J{s) evaluated at 6* = is 
1 '■^ 



Now 



1 + , / X ^/yx.x / {s(\x\) - t(m)) (f){wx)w^ fw{w)dwdx. (5) 

t{m)E{W) J_k Jo 



2rrf^/'^ 1 f°° ( \ \ 

(\){wx)w^ fw{w)dw= io\om/2 / ^^""^^ exp -- (m + x^) w^ dw 

y Ztx i \rai A)L ' Jq \ / / 

where F denotes the gamma function. By (A2.1.3) on p. 144 of Box and Tiao (1973), 

this is equal to 

-, / N (m/2)+l 



^2tt yx"^ + m 
11 



([2]) follows from this and ([5]). 

Proof of part (c) 

The coverage probability of J{s) is equal to 

p(a - ts{m/t) < A < A + S s(|A|/S)). (6) 

By the law of total probability, this is equal to 

p(a - ts{m/t) <f3,<p, + ts{m/t), lAi <kt^ 

+ P{^h - Ss(|A|/S) < A < /3. + S s(|A|/S), lAI >k€). 
The second term in this sum is equal to 

p(A - t(m) S < A < A + t(m) % lAI > A;S). 
By the law of total probability, this is equal to 

l-a-P(i3i-t{m)t <(3i < A + t(m) S, |/3i| < kt\ 
Thus (^ is equal to 

1 - « + p(a - s s(|A|/s) < A < A + ts{m/t),\p,\ <kE^ 

- P (A - t{m) S < A < A + t(m) S, \Pi\<k S^ 
This is equal to 

i-a + p(e-ws{\Q\/w) <e<e + ws{\e\/w),\e\ <kw^ 
-p(&-t{m)w <e<e + t{m)w,\e\ <kwy 

where G) = Pi/cr, O = A/cr, = A/a and PF = E/cr. It may be shown that 

{sign(e)(|0|-W^r7)^ ii\e\<2Wr] 

{{a - 1)9 - sign{Q)aWr])/{a - 2) if 2r] < \Q\ < ar] 
e if |e| > aWr]. 

Now define the function g by G = g{Q,W). Thus 

p(e-ws{\<d\/w) <e< & + ws{\e\/w), \e\ <kw^ 

= Efl{g{e,W) -Ws{\<d\/W) <e< g{&,W) + W s{\<d\/W))l{\<d\ < kW) 



oo pkw 

l[g{x,w) — w s{\x\/w) < < g{x,w) — iv s{\x\/w)) 0(x — 9) fwi'w) dxdw. 

kw 

(7) 
12 



Now change the variable of integration of the inner integral to y = x/w. Thus ([7]) 
is equal to 

/ ^{g{wy,w)-ws{\y\) <e < g{wy,w) - w s{\y\)) (j){wy - 9) w fw{w) dy dw. 

J-k 

(8) 
It may be shown that g{wy,w) = wh(y). Thus ([8]) is equal to 



oo pk 



I ( h{x) — s{\x\) < — < h{x) + s{\x\) ) (j){wx — 9) w fw{w) dxdw 



J-k 



W 



k /"OO 



-fc Jo 



I ( h{x) — s(|a;|) < — < /i(a;) + s{\x\) ) ^(wa:; — 9)w /iy(w) dwdx. (9) 



lu 



Now 

p('e-t(m)vr<0<0 + t(m)vr, |e| <rfvr' 

= P('-t(m)l^< Z<t(m)lV, |Z + e| <rfl^) (10) 

where Z = 6 — 6*, so that Z ~ A^(0, 1). Observe that (TTUl) is equal to 

/ P(Z G [-t{m)w,t{m)w] fl [-(iw -6',(iM;-6']) fw{w)dw. 
Jo 

It may be shown that this is equal to 

b{w; m, d, 9) fwiw) dw. 



Thus the coverage probability is equal to ([3j). Now Q and flTOj) may be shown to 
be even functions of 9. It follows that the coverage probability is an even function 
oi9. 

Appendix B: Computation of the coverage probability 

By Theorem 1(c), for given functions h and s, the coverage probability of J{s) 
is an even function of 9. Consequently, we only need to compute this coverage 
probability for ^ > 0. To compute this coverage probability using ([3]), we need to 
compute 

(9 \ 

h{x) - s(|x|) < - < h{x) + s{\x\) J (t){wx -9)w fw{w) dw (11) 

for given x E [—d, d] . We consider the following 2 cases. 
Case 1:6 = 

13 



In this case 



^/,,, ,, ,, e ,,, ,, ,A 1 if /i(x) -s(|x|) <0 and /i(a;) + s(|x|) > 

X[h{x)-s{\x\)<-<h{x) + s{\x\)]={ y, ^' '^- ^' ^''^- 

\ w / I otherwise. 

Thus 

j /g°° (f){wx) w fwi^) dw if h{x) — s(|x|) < and h{x) + s{\x\) > 
lo otherwise. 

We find a convenient expression for 

) 

(t){wx)w fwiw) dw (12) 



'0 

as follows. Substituting the formulae for and fw into f lT^ . we find that 
1 2m™/2 /-oo _ / 1 



I2^ = ^; r(W2)2W2 / ^"^exp(-^(a:^ + mV^)d^. (13) 



By (A2.1.3) on p.l44 of Box and Tiao (1973), 

1 r((m + l)/2) ( m y/^ 1 



m 



/? r(m/2) \x'^ + mj ^x^ + m 
Case 2: 6» > 

Subcase (a): h{x) — s{\x\) > and h{x) + s(|a3|) > 
In this subcase, 

/ n \ f f) f) 

I h{x) - s{\x\) < - < h{x) + s{\x\) = X — — < ^ < 



w J \h{x) + s{\x\) h{x) — s{\x\) 

Thus, in this subcase, 

f ITT]) = / (j){wx — 9)w fw{w) dw. 

Je/{h(x)+s(\x\)) 

Subcase (b): h{x) — s{\x\) < and h{x) + s{\x\) > 
In this subcase, 

X Ihix) - s{\x\) <- < hix) + si\x\)\ = X ( < - < hix) + s(\x\)\ since - > 
\ w J \ w J w 

Thus, in this subcase, 

/■oo 

flTTl) = / (j){wx — 6)w fw{w) dw. 

Je/{h{x)+s(\x\)) 

14 



Subcase (c): h(x) — s(\x\) < and h(x) + s(|a;|) < 

In this subcase, 

(0 \ 6 

h(x) - s(\x\) <- < hix) + s{\x\) ] = since - > 0. 
w J w 

Thus, in this subcase, (TTTj) = 0. 
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